Submitted to the Bernoulli 



Least Squares After Model Selection in 
High-dimensional Sparse Models 

Alexandre Belloni and Victor Chernozhukov 

In this paper we study post-model selection estimators which apply ordinary least squares 
(ols) to the model selected by first-step penalized estimators, typically lasso. It is well known 
that lasso can estimate the nonparametric regression function at nearly the oracle rate, and 
is thus hard to improve upon. We show that ols post lasso estimator performs at least as 
well as lasso in terms of the rate of convergence, and has the advantage of a smaller bias. 
Remarkably, this performance occurs even if the lasso-based model selection "fails" in the sense 
of missing some components of the "true" regression model. By the "true" model we mean 
here the best s-dimensional approximation to the nonparametric regression function chosen 
by the oracle. Furthermore, ols post lasso estimator can perform strictly better than lasso, in 
the sense of a strictly faster rate of convergence, if the lasso-based model selection correctly 
includes all components of the "true" model as a subset and also achieves sufficient sparsity. 
In the extreme case, when lasso perfectly selects the "true" model, the ols post lasso estimator 
becomes the oracle estimator. An important ingredient in our analysis is a new sparsity bound 
on the dimension of the model selected by lasso which guarantees that this dimension is at most 
of the same order as the dimension of the "true" model. Our rate results are non-asymptotic 
and hold in both parametric and nonparametric models. Moreover, our analysis is not limited 
to the lasso estimator acting as selector in the first step, but also applies to any other estimator, 
for example various forms of thresholded lasso, with good rates and good sparsity properties. 
Our analysis covers both traditional thresholding and a new practical, data-driven thresholding 
scheme that induces maximal sparsity subject to maintaining a certain goodness-of-fit. The latter 
scheme has theoretical guarantees similar to those of lasso or ols post lasso, but it dominates 
these procedures as well as traditional thresholding in a wide variety of experiments. 
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1. Introduction 



In this work we study post-model selected estimators for linear regression in high-di- 
mensional sparse models (hdsms). In such models, the overall number of regressors p is 
very large, possibly much larger than the sample size n. However, there are s = o(n) 
regressors that capture most of the impact of all covariates on the response variable, 
hdsms ([9], [22]) have emerged to deal with many new applications arising in biometrics, 
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signal processing, machine learning, econometrics, and other areas of data analysis where 
high-dimensional data sets have become widely available. 

Several papers have begun to investigate estimation of hdsms, primarily focusing on 
mean regression with the £i-norm acting as a penalty function [4, 6, 7, 8, 9, 17, 22, 28, 
31, 33]. The results in [4, 6, 7, 8, 17, 22, 31, 33] demonstrated the fundamental result that 
^i-penalized least squares estimators achieve the rate y/s/riyJ\ogp : which is very close 
to the oracle rate \fsjn achievable when the true model is known. The works [17, 28] 
demonstrated a similar fundamental result on the excess forecasting error loss under 
both quadratic and non-quadratic loss functions. Thus the estimator can be consistent 
and can have excellent forecasting performance even under very rapid, nearly exponential 
growth of the total number of regressors p. Also, [2] investigated the ^i-penalized quantile 
regression process, obtaining similar results. See [4, 6, 7, 8, 15, 19, 20, 24] for many other 
interesting developments and a detailed review of the existing literature. 

In this paper we derive theoretical properties of post-model selection estimators which 
apply ordinary least squares (ols) to the model selected by first-step penalized estimators, 
typically lasso. It is well known that lasso can estimate the mean regression function at 
nearly the oracle rate, and hence is hard to improve upon. We show that ols post lasso can 
perform at least as well as lasso in terms of the rate of convergence, and has the advantage 
of a smaller bias. This nice performance occurs even if the lasso-based model selection 
"fails" in the sense of missing some components of the "true" regression model. Here 
by the "true" model we mean the best s-dimensional approximation to the regression 
function chosen by the oracle. The intuition for this result is that lasso-based model 
selection omits only those components with relatively small coefficients. Furthermore, 
ols post lasso can perform strictly better than lasso, in the sense of a strictly faster 
rate of convergence, if the lasso-based model correctly includes all components of the 
"true" model as a subset and is sufficiently sparse. Of course, in the extreme case, when 
lasso perfectly selects the "true" model, the ols post lasso estimator becomes the oracle 
estimator. 

Importantly, our rate analysis is not limited to the lasso estimator in the first step, but 
applies to a wide variety of other first-step estimators, including, for example, thresholdcd 
lasso, the Dantzig selector, and their various modifications. We give generic rate results 
that cover any first-step estimator for which a rate and a sparsity bound are available. 
We also give a generic result on using thresholded lasso as the first-step estimator, where 
thresholding can be performed by a traditional thresholding scheme (t-lasso) or by a new 
fitness-thresholding scheme we introduce in the paper (fit-lasso). The new thresholding 
scheme induces maximal sparsity subject to maintaining a certain goodness-of-fit in the 
sample, and is completely data-driven. We show that ols post fit-lasso estimator performs 
at least as well as the lasso estimator, but can be strictly better under good model 
selection properties. 

Finally, we conduct a series of computational experiments and find that the results 
confirm our theoretical findings. Figure 1 is a brief graphical summary of our theoreti- 
cal results showing how the empirical risk of various estimators change with the signal 
strength C (coefficients of relevant covariates are set equal to C). For very low level of 
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signal, all estimators perform similarly. When the signal strength is intermediate, ols post 
lasso and ols post fit-lasso substantially outperform lasso and the ols post t-lasso esti- 
mators. However, we find that the ols post fit-lasso outperforms ols post lasso whenever 
lasso does not produce very sparse solutions which occurs if the signal strength level is 
not low. For large levels of signal, ols post fit-lasso and ols post t-lasso perform very well 
improving upon lasso and ols post lasso. Thus, the main message here is that ols post 
lasso and ols post fit-lasso perform at least as well as lasso and sometimes a lot better. 
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Figure 1. This figure plots the performance of the estimators listed in the text under the equi-correlatcd 
design for the covariates 2; ~ ^(0, E), Ejj. = 1/2 if j ^ k. The number of regressors is p = 500 and the 
sample size is n = 100 with 1000 simulations for each level of signal strength C. In each simulation there 
are 5 relevant covariates whose coefficients are set equal to the signal strength C, and the variance of 
the noise is set to 1. 



To the best of our knowledge, our paper is the first to establish the aforementioned 
rate results on ols post lasso and the proposed ols post fitness-thresholded lasso in the 
mean regression problem. Our analysis builds upon the ideas in [2] , who established the 
properties of post-penalized procedures for the related, but different, problem of median 
regression. Our analysis also builds on the fundamental results of [4] and the other works 
cited above that established the properties of the first-step lasso-type estimators. An 
important ingredient in our analysis is a new sparsity bound on the dimension of the 
model selected by lasso, which guarantees that this dimension is at most of the same 
order as the dimension of the "true" model. This result builds on some inequalities 
for sparse eigenvalues and reasoning previously given in [2] in the context of median 
regression. Our sparsity bounds for lasso improve upon the analogous bounds in [4] and 
are comparable to the bounds in [33] obtained under a larger penalty level. We also rely 
on maximal inequalities in [33] to provide primitive conditions for the sharp sparsity 
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bounds to hold. 

We organize the paper as follows. Section 2 reviews the model and discusses the 
estimators. Section 3 revisits some benchmark results of [4] for lasso, albeit allowing for 
a data driven choice of penalty level, develops an extension of model selection results 
of [19] to the nonparametric case, and derives a new sparsity bound for lasso. Section 4 
presents a generic rate result on ols post-model selection estimators. Section 5 applies 
the generic results to the ols post lasso and the ols post thresholded lasso estimators. 
Appendix contains main proofs and the Supplementary Appendix contains auxiliary 
proofs. In the Supplementary Appendix we also present the results of our computational 
experiments. 

Notation. In making asymptotic statements, we assume that n — > oo and p = p n — > 
oo, and we also allow for s = s n — > oo. In what follows, all parameter values are indexed 
by the sample size n, but we omit the index whenever this does not cause confusion. We 
use the notation (a)+ = max{a,0}, aV6 = max{a, 6} and aAb = min{a,6}. The ^2-norm 
is denoted by || • ||, the fi-norm is denoted by || • j|i, the £oo-norm is denoted by || ■ |oc, 
and the fo-norm || • ||o denotes the number of non-zero components of a vector. Given a 
vector 5 G M p , and a set of indices T C { 1 , . . . , p] , we denote by St the vector in which 
§Tj = 5j if j G T, &Tj = if j T, and by |T| the cardinality of T. Given a covariatc 
vector Xi G H p , we denote by Xi[T] vector {xij, j G T}. The symbol E[] denotes the 
expectation. We also use standard empirical process notation 

n n 

E n [f(z.)]:=J2f(zi)/n and G n (/(*.)) := ^(/(^) - E[f{ Zi )))/^. 

i=l i=l 

We also denote the L 2 (P„)-norm by ||/||p„,2 = (EjJ/ 2 ]) 1 / 2 . Given covariate values 
x\, . . . , x n , we define the prediction norm of a vector 5 G H p as ||<5||2,n = {E n [(x-(S) 2 ]} :L ' 2 . 
We use the notation a < b to denote a < Cb for some constant C > that does not 
depend on n (and therefore does not depend on quantities indexed by n like pors); and 
a <p b to denote a = Op(b). For an event A, we say that A wp — s- 1 when A occurs 
with probability approaching one as n grows. Also we denote by c = (c+ l)/(c— 1) for 
a chosen constant c > 1. 

2. The setting, estimators, and conditions 
2.1. The setting 

Condition ( M ). We have data {(j/i, Zi),i — 1, . . . , n} such that for each n 

y l = f{z l ) + e ll ei ~iV(0,o- 2 ), i = l,...,n, (2.1) 

where yt are the outcomes, Zi are vectors of fixed regressors, and are i.i.d. errors. Let 
P(zi) be a given p-dimensional dictionary of technical regressors with respect Zi, i.e. a 
p-vector of transformation of Zi, with components 

Xi := P{zi) 

imsart-bj ver. 2009/08/13 file: Post-LASS0-SecondRevision_AfterSubmitted_v01.tex date: August 26, 2011 



5 



of the dictionary normalized so that 

^■n[xlj] = 1 for j = l,...,p. 

In making making asymptotic statements, we assume that n — > oo and p = p n — > oo, and 
that all parameters of the model are implicitly indexed by n. 

Wc would like to estimate the nonparamctric regression function / at the design points, 
namely the values fi = f(zi) for i = 1, . . . , n. In order to setup estimation and define a 
performance benchmark we consider the following oracle risk minimization program: 

k 

min cl + a 2 -, (2.2) 

0<fe<pAn n 

where 

4:= min E„[(/. - x'^) 2 }. (2.3) 

|/3||o<fc 

Note that c 2 + a 2 k/n is an upper bound on the risk of the best A:-sparse least squares 
estimator, i.e. the best estimator amongst all least squares estimators that use k out of 
p components of Xi to estimate fi, for i = 1, . . . , n. The oracle program (2.2) chooses the 
optimal value of k. Let s be the smallest integer amongst these optimal values, and let 

ftearg min E n [(f. - x'J) 2 ]. (2.4) 

ll/3||o<s 

We call /3o the oracle target value, T := support(/?o) the oracle model, s := \T\ = ||/3o||o 
the dimension of the oracle model, and a^/3o the oracle approximation to fi. The latter is 
our intermediary target, which is equal to the ultimate target fi up to the approximation 
error 

n ■= fi - x'iPo- 

If we knew T wc could simply use Xi[T] as regressors and estimate fi, for i — 1, . . . ,n, 
using the least squares estimator, achieving the risk of at most 

c 2 + cr 2 s/n, 

which we call the oracle risk. Since T is not known, wc shall estimate T using lasso-type 
methods and analyze the properties of post-model selection least squares estimators, 
accounting for possible model selection mistakes. 

Remark 2.1 (The oracle program). Note that if argmin is not unique in the prob- 
lem (2.4), it suffices to select one of the values in the set of argmins. Supplementary 
Appendix provides a more detailed discussion of the oracle problem. The idea of using 
oracle problems such as (2.2) for benchmarking the performance follows its prior uses in 
the literature. For instance, see [4], Theorem 6.1, where an analogous problem appears 
in upper bounds on performance of lasso. □ 
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Remark 2.2 (A leading special case). When contrasting the performance of lasso and 
ols post lasso estimators in Remarks 5.1-5.2 given later, we shall mention a balanced case 
where 

c 2 s < e 2 s/n (2.5) 

which says that the oracle program (2.2) is able to balance the norm of the bias squared 
to be not much larger than the variance term a 2 s/n. This corresponds to the case that 
the approximation error bias does not dominate the estimation error of the oracle least 
squares estimator, so that the oracle rate of convergence simplifies to \J s/n mentioned 
in the introduction. 



2.2. Model selectors based on lasso 

Given the large number of regressors p > n, some regularization or covariate selection 
is required in order to obtain consistency. The lasso estimator [26], defined as follows, 
achieves both tasks by using the l\ penalization: 

£eargminQ(/3) + -||/3|| 1 , where Q(fi) = E„[(y. - x',/3) 2 ], (2.6) 

and A is the penalty level whose choice is described below. If the solution is not unique 
we pick any solution with minimum support. The lasso is often used as an estimator and 
more often only as a model selection device, with the model selected by lasso given by: 

T := support (j3). 

Moreover, we denote by fh := \T \ T\ the number of components outside T selected by 
lasso and by fa = x[j3, i = 1, . . . , n the lasso estimate of i = 1, . . . , n. 

Oftentimes additional thresholding is applied to remove regressors with small esti- 
mated coefficients, defining the so called thresholdcd lasso estimator: 

P(t) = j l{\p j \>t},j = l,...,p), (2.7) 

where t > is the thresholding level, and the corresponding selected model is then 

f(t) := support (i)). 

Note that setting t = 0, we have T(t) = T, so lasso is a special case of thresholded lasso. 

2.3. Post-model selection estimators 

Given this all of our post-model selection estimators or ols post lasso estimators will take 
the form 
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That is given the model selected a threshold lasso T(t), including the lasso's model T(0) 
as a special case, the post-model selection estimator applies the ordinary least squares 
to the selected model. 

In addition to the case of t = 0, we also consider the following choices for the threshold 
level: 

traditional threshold (t): t > ( = max \/3j — Ay I, 

X - J - P _ _ _ _ (2 9) 

fitness-based threshold (fit): t = t 7 := max{t : - Q(fi) < 7}, 

where 7 < 0, and I7I is the gain of the in-sample fit allowed relative to lasso. 

As discussed in Section 3.2, the standard thresholding method is particularly appeal- 
ing in models in which oracle coefficients /3q are well separated from zero. This scheme 
however may perform poorly in models with oracle coefficients not well separated from 
zero and in nonparamctric models. Indeed, even in parametric models with many small 
but non-zero true coefficients, thresholding the estimates too aggressively may result in 
large goodncss-of-fit losses, and consequently in slow rates of convergence and even incon- 
sistency for the second-step estimators. This issue directly motivates our new goodness- 
of-fit based thresholding method, which sets to zero small coefficient estimates as much 
as possible subject to maintaining a certain goodness-of-fit level. 

Depending on how we select the threshold, we consider the following three types of 
the post-model selection estimators: 

ols post lasso: (3° (t = 0), 

ols post t-lasso: ft (t > (), (2.10) 
ols post fit-lasso: /3* 7 (t = fcy). 

The first estimator is defined by ols applied to the model selected by lasso, also called 
Gauss-lasso; the second by ols applied to the model selected by the thresholdcd lasso, 
and the third by ols applied to the model selected by fitness-thresholded lasso. 

The main purpose of this paper is to derive the properties of the post-model selection 
estimators (2.10). If model selection works perfectly, which is possible only under rather 
special circumstances, then the post-model selection estimators are the oracle estimators, 
whose properties are well known. However, of a much more general interest is the case 
when model selection does not work perfectly, as occurs for many designs of interest in 
applications. 

2.4. Choice and computation of penalty level for lasso 

The key quantity in the analysis is the gradient of Q at the true value: 

S = 2E n [x.e.}. 

This gradient is the effective "noise" in the problem that should be dominated by the reg- 
ularization. However we would like to make the bias as small as possible. This reasoning 
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suggests choosing the smallest penalty level A so that to dominate the noise, namely 

A > 0711151100 with probability at least 1 — a, (2-11) 

where probability 1 — a needs to be close to 1 and c > 1. Therefore, we propose setting 

A = c' a A(l - a|X),for some fixed c > c> 1 , (2.12) 

where A(l — a\X) is the (1 — a)-quantilc of ?i||S , /cr|| 00 , and a is a possibly data-driven 
estimate of a . Note that the quantity A(l — a\X) is independent of a and can be easily 
approximated by simulation. We refer to this choice of A as the data-driven choice, 
reflecting the dependence of the choice on the design matrix X — [x\, . . . ,x n ]' and a 
possibly data-driven a. Note that the proposed (2.12) is sharper than c'a2 \/2n \og(p /a) 
typically used in the literature. We impose the following conditions on a. 

Condition (V). The estimated a obeys 

£ < a I a < u with probability at least 1 — r, 
where < £ < 1 and 1 < u and < r < 1 be constants possibly dependent on n. 

We can construct a a that satisfies this condition under mild assumptions as follows. 
First, set a = So, where (To is an upper bound on a which is possibly data-driven, for 
example the sample standard deviation of yi. Second, compute the lasso estimator based 
on this estimate and set a 2 — Q(J3). We demonstrate that a constructed in this way 
satisfies Condition V and characterize quantities u and £ and r in the Supplementary 
Appendix. We can iterate on the last step a bounded number of times. Moreover, we can 
similarly use ols post lasso for this purpose. 

2.5. Choices and computation of thresholding levels 

Our analysis will cover a wide range of possible threshold levels. Here, however, we would 
like to propose some basic options that give both good finite-sample and theoretical 
results. In the traditional thresholding method, we can set 

t = c\/n, (2.13) 

for some c > 1. This choice is theoretically motivated by Section 3.2 that presents the 
perfect model selection results, where under some conditions C < cX/n. This choice 
also leads to near-oracle performance of the resulting post-model selection estimator. 
Regarding the choice of c, we note that setting c = 1 and achieving £ < A/rc is possible 
by the results of Section 3.2 if empirical Gram matrix is orthogonal and approximation 
error c s vanishes. Thus, c = 1 is the least aggressive traditional thresholding one can 
perform under conditions of Section 3.2 (note also that c = 1 has performed better than 
c > 1 in our computational experiments). 
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Our fitness-based threshold i 7 requires the specification of the parameter 7. The sim- 
plest choice delivering near-oracle performance is 7 = 0; this choice leads to the sparsest 
post-model selection estimator that has the same in-sample fit as lasso. Our preferred 
choice is however to set 

7 J^ 2 ^<0, (2.14) 

where /3° is the ols post lasso estimator. The resulting estimator is more sparse than lasso, 
and it also produces a better in-sample fit than lasso. This choice also results in near- 
oracle performance and also leads to the best performance in computational experiments. 
Note also that for any 7, we can compute i 7 by a binary search over t G sort{|/3j|, j € T}, 
where sort is the sorting operator. This is the case since the final estimator depends only 
on the selected support and not on the specific value of t used. Therefore, since there are 
at most \T\ different values of t to be tested, by using a binary search, we can compute 
t-y exactly by running at most [log 2 |T|] ordinary least squares problems. 

2.6. Conditions on the design 

For the analysis of lasso we rely on the following restricted eigenvalue condition. 
Condition (RE(c)). For a given c> 0, 

K[c) := mm ' > 0. 

||(5 T c ||i<c||5t||i,<s#o ||ot||i 

This condition is a variant of the restricted eigenvalue condition introduced in [4] , that 
is known to be quite general and plausible; see also [4] for related conditions. 

For the analysis of post-model selection estimators we need the following restricted 
sparse eigenvalue condition. 

Condition (RSE(m)). For a given m < n, 

K(m) := mm > 0, <p(m) := max „ > 0. 

||(5 T <=||o<m,a^0 \\S\\ 2 ||5 T c|| <m,5^0 \\6\\ 2 

Here m denotes the restriction on the number of non-zero components outside the 
support T. It will be convenient to define the following condition number associated with 
the empirical Gram matrix: 

K(m) 

The following lemma demonstrates the plausibility of conditions above for the case 
where the values Xi, i = 1, . . . , n, have been generated as a realization of the random 
sample; there are also other primitive conditions. In this case we can expect that em- 
pirical restricted eigenvalue is actually bounded away from zero and (2.15) is bounded 
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from above with a high probability. The lemma uses the standard concept of (unre- 
stricted) sparse eigenvalues (see, e.g. [4]) to state a primitive condition on the population 
Gram matrix. The lemma allows for standard arbitrary bounded dictionaries, arising in 
the nonparametric estimation, for example regression splines, orthogonal polynomials, 
and trigonometric series, see [14, 29, 32, 27]. Similar results are known to also hold for 
standard Gaussian regressors [33]. 

Lemma 1 (Plausibility of RE and RSE). Suppose Xi, i = 1, . . . , n, are i.i.d. zero-mean 
vectors, such that the population design matrix E^a;^] has ones on the diagonal, and its 
s log n-sparse eigenvalues are bounded from above by ip < oo and bounded from below by 
k 2 > 0. Define Xi as a normalized form ofxi, namely Xij = £y/(E„[:r^,.]) 1/ ' 2 . Suppose that 

Xi maxi<K„ ||ii||oo < K n a.s., and K 2 s log 2 (n) log 2 (s logn) log(pVn) = o(nn 4 /ip). Then, 
for any m + s < s log n, the empirical restricted sparse eigenvalues obey the following 
bounds: 

4>{m) < 4<p, K,(m) 2 > k 2 /4, and (J,(m) < iy/Ip/n, 
with probability approaching 1 as n — > oo. 



3. Results on lasso as an estimator and model 
selector 

The properties of the post-model selection estimators will crucially depend on both the 
estimation and model selection properties of lasso. In this section we develop the esti- 
mation properties of lasso under the data-dependent penalty level, extending the results 
of [4], and develop the model selection properties of lasso for non-parametric models, 
generalizing the results of [19] to the nonparametric case. 

3.1. Estimation Properties of lasso 

The following theorem describes the main estimation properties of lasso under the data- 
driven choice of the penalty level. 

Theorem 1 (Performance bounds for lasso under data-driven penalty). Suppose that 
Conditions M and RE(c) hold for c = (c + l)/(c — 1). If X > cn|jS'|j 00 , then 

||j8-A)lkn< (l + -)^- + 2c s . 
\ c J n,K{c) 

Moreover, suppose that Condition V holds. Under the data-driven choice (2.12), for d > 
cjl, we have A > c?t, 1 1 5 1 1 1 oo with probability at least 1 — a — r, so that with at least the 
same probability 

0- Poh,n < (c 1 + c'/c) -^-auk{\ - a\X) + 2c s , where A(l - a\X) < y/2n\og[p/a). 

nnyc) 
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If further RE (2c ) holds, then 



ll^-Alli < ^"""^H v U 1 + W ^-Ta c : 

This theorem extends the result of [4] by allowing for data-driven penalty level and 
deriving the rates in £i-norm. These results may be of independent interest and are 
needed for subsequent results. 

Remark 3.1. Furthermore, a performance bound for the estimation of the regression 
function follows from the relation 

||/-/||p„,2-||/8-A)||2,„| <c s , (3.16) 

where fi = x'fi is the lasso estimate of the regression function / evaluated at Zj. It is 
interesting to know some lower bounds on the rate which follow from Karush-Kuhn- 
Tucker conditions for lasso (see equation (A. 25) in the appendix): 



11/ /II > (1 ~ 1/c)A v |T| 

2ny/ 4>{fh) 

where fh = \T\T\. We note that a similar lower bound was first derived in [21] with <p(p) 
instead of 4>{m). □ 

The preceding theorem and discussion imply the following useful asymptotic bound 
on the performance of the estimators. 

Corollary 1 (Asymptotic bounds on performance of lasso). Under the conditions of 
Theorem 1, if 

§(yn) ^ 1) K (c) ^ 1, jtf(m) < 1, log(l/a) < logp, a = o(l), ujl < 1, and r = o(l) 

(3.17) 

hold as n grows, we have that 



II? #11 < /slogp 

/ - / P„,2 JSp V\\ 1" C s . 

V n 

Moreover, if \T\ >p s, in particular ifTCT with probability going to 1, we have 



11/ - /||p„,2 a\/ ° gP . 

V n 

In Lemma 1 wc established fairly general sufficient conditions for the first three rela- 
tions in (3.17) to hold with high probability cis ti grows, when the design points Z\ % . . . , z n 
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were generated as a random sample. The remaining relations are mild conditions on the 
choice of a and the estimation of a that are used in the definition of the data-driven 
choice (2.12) of the penalty level A. 

It follows from the corollary that provided k(c) is bounded away from zero, lasso with 
data-driven penalty estimates the regression function at a near-oracle rate. The second 
part of the corollary generalizes to the nonparamctric case the lower bound obtained for 
lasso in [21]. It shows that the rate cannot be improved in general. We shall use the 
asymptotic rates of convergence to compare the performance of lasso and the post-model 
selection estimators. 

3.2. Model selection properties of lasso 

The main results of the paper do not require the first-step estimators like lasso to perfectly 
select the "true" oracle model. In fact, we are specifically interested in the most common 
cases where these estimators do not perfectly select the true model. For these cases, we 
will prove that post-model selection estimators such as ols post lasso achieve near-oracle 
rates like those of lasso. However, in some special cases, where perfect model selection is 
possible, these estimators can achieve the exact oracle rates, and thus can be even better 
than lasso. The purpose of this section is to describe these very special cases where perfect 
model selection is possible. 

Theorem 2 (Some conditions for perfect model selection in nonparamctric setting). 
Suppose that Condition M holds. (1) If the coefficients are well separated from zero, that 
is 

min |/?oj | > C + tj f or some t > £ := max \j3j — (3oj\, 
jeT j=i,...,p 

then the true model is a subset of the selected model, T := support(/3o) CT:= support(/3). 
Moreover, T can be perfectly selected by applying level t thresholding to j3, i.e. T = T(t). 
(2) In particular, if A > cnHSHoo, and there is a constant U > 5c such that the empirical 
Gram matrix satisfies \& n [x,jX»k]\ < l/(Us) for all 1 < j < k <p, then 

A U + c a 6c c s 4c n ci 

^n-— c + Tn ACs + —c-T S + UX-i- 

These results substantively generalize the parametric results of [19] on model selection 
by thresholdcd lasso. These results cover the more general nonparamctric case and may 
be of independent interest. Note also that the conditions for perfect model selection 
stated require a strong assumption on the separation of coefficients of the oracle from 
zero, and also a near perfect orthogonality of the empirical Gram matrix. This is the 
sense in which the perfect model selection is a rather special, non-general phenomenon. 
Finally, we note that it is possible to perform perfect selection of the oracle model by lasso 
without applying any additional thresholding under additional technical conditions and 
higher penalty levels [34, 31, 5]. In the supplement we state the nonparametric extension 
of the parametric result due to [31]. 
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3.3. Sparsity properties of lasso 

We also derive new sharp sparsity bounds for lasso, which may be of independent interest. 
We begin with a preliminary sparsity bound for lasso. 

Lemma 2 (Empirical pre-sparsity for lasso). Suppose that Conditions M and RE(c) 
hold, A > cn||S'|| 00 , and let fh = \T \ T\ . We have for c = (c + l)/(c— 1) that 



The lemma above states that lasso achieves the oracle sparsity up to a factor of <f>(fh). 
Under the conditions (2.5) and k(c) > 1, the lemma above immediately yields the simple 
upper bound on the sparsity of the form 



as obtained for example in [4] and [22]. Unfortunately, this bound is sharp only when 
4>(n) is bounded. When (f>(n) diverges, for example when <p(n) >p y/logp in the Gaussian 
design with p > In by Lemma 6 of [3], the bound is not sharp. However, for this case 
we can construct a sharp sparsity bound by combining the preceding pre-sparsity result 
with the following sub-linearity property of the restricted sparse eigenvalues. 

Lemma 3 (Sub- linearity of restricted sparse eigenvalues). For any integer k > and 
constant I > 1 we have (j)(\£k]) < \V\(j){k). 

A version of this lemma for unrestricted sparse eigenvalues has been previously proven 
in [2] . The combination of the preceding two lemmas gives the following sparsity theorem. 

Theorem 3 (Sparsity bound for lasso under data-driven penalty). Suppose that Con- 
ditions M and RE(c) hold, and let fh := \T\T\. The event A > cn||S'|| 00 implies that 



where M = {m e N : m > scf)(m A n) ■ 2L„} and L n = [2c/«(c) + 3(c+ l)nc s / '{\^fs)] 2 . 

The main implication of Theorem 3 is that under (2.5), if min me » (f>(m A n) < 1 and 
A > Grills' | |oo hold with high probability, which is valid by Lemma 1 for important designs 
and by the choice of penalty level (2.12), then with high probability 



Consequently, for these designs and penalty level, lasso's sparsity is of the same order as 
the oracle sparsity, namely s" := |T| < s + m < s with high probability. The reason for this 
is that min rne ^vi 4>{m) 4>{n) for these designs, which allows us to sharpen the previous 
sparsity bound (3.18) considered in [4] and [22]. Also, our new bound is comparable to 
the bounds in [33] in terms of order of sharpness, but it requires a smaller penalty level 
A which also does not depend on the unknown sparse eigenvalues as in [33]. 
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(3.18) 



fh < s ■ min <f>(m An) ■ L. 



fh < s. 



(3.19) 
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4. Performance of post-model selection estimators 
with a generic model selector 

Next, we present a general result on the performance of a post-model selection estimator 
with a generic model selector. 

Theorem 4 (Performance of post-model selection estimator with a generic model se- 
lector) . Suppose Condition M holds and let (3 be any first-step estimator acting as the 
model selector and denote by T :— support (/3) the model it selects, such that \T\ < n. Let 
/3 be the post-model selection estimator defined by 

(3 e arg mm Q(/3) : f3 3 = 0, for each j e f c . (4.20) 

Let B n :~ Q(fi) — Q(fio) and C n :— Q(P f) — Q(/?o) an d fh =\T\T\ be the number 
of wrong regressors selected. Then, if condition RSE(fh) holds, for any e > 0, there is a 
constant K e independent of n such that with probability at least 1 — e, for fi = x\\i we 
have 

ii 7 *ii s v / ™logP + (jn + gj log(e/i(m)) , 
V n 

Furthermore, for any e > 0, there is a constant K £ independent of n such that with 
probability at least 1 — e. 



B n < ||£-A>ll2,n + 



C n <l{T£T} ( ||^ ||i,„ + 

Three implications of Theorem 4 are worth noting. First, the bounds on the prediction 
norm stated in Theorem 4 apply to the ols estimator on the components selected by any 
first-step estimator /3, provided we can bound both the rate of convergence \\j3 — Po\\2,n 
of the first-step estimator and m, the number of wrong regressors selected by the model 
selector. Second, note that if the selected model contains the true model, TCT, then 
we have (B n ) + A (C n )+ = C n = 0, and B n does not affect the rate at all, and the 
performance of the second-step estimator is determined by the sparsity fh of the first- 
step estimator, which controls the magnitude of the empirical errors. Otherwise, if the 
selected model fails to contain the true model, that is, T % T, the performance of the 
second-step estimator is determined by both the sparsity fh and the minimum between 
B n and C n - The quantity B n measures the in-sample loss-of-fit induced by the first-step 
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estimator relative to the "true" parameter value /3o, and C n measures the in-sample 
loss-of-fit induced by truncating the "true" parameter (3 outside the selected model T. 

The proof of Theorem 4 relies on the sparsity-based control of the empirical error 
provided by the following lemma. 

Lemma 4 (Sparsity-based control of empirical error). Suppose Condition M holds. (1) 
For any e > 0, there is a constant K £ independent of n such that with probability at least 
1-e, 



Wo + 5) - Q(M - MiJ < ^V ml ° SP+(m t S)l0S(eMm)) ||J|12 - + ^ 
uniformly for all 6 G MP such that ||5t c ||o < Til, and uniformly over m < n. 

(2) Furthermore, with at least the same probability, 



'log(£)+fclogM0)). 



IQ(/?or) ~ Q(A>) - Wofc\\ln\ < K s o-\j "° Vfe/ ' ••" BV " n ' ;/ 1|^|| 2 ,» + 2c < || J 8 0f .|| 2 ,n, 



n 



uniformly for all T C T such that \T\T\ = k, and uniformly over k < s. 

The proof of the lemma in turn relies on the following maximal inequality, whose proof 
involves the use of Samorodnitsky-Talagrand's type inequality. 

Lemma 5 (Maximal inequality for a collection of empirical processes). Let 6j ~ N(0, a 2 ) 
be independent for i = 1, . . . , n, and for m = 1, . . . , n define 



e„(m,T/) := (t2V2 log Jj^j +^(m + s) log (D^m)) + y/(m + s) \og(l/n) 



for any r\ S (0, 1) and some universal constant D. Then 

< e„(m, 77), for all m < n, 



sup 

||(5 T c|| <m,||5|| 2 ,„>0 



\S\\2,n 

with probability at least 1 — ne~ s /(l — 1/e). 



5. Performance of least squares after lasso-based 
model selection 

In this section we specialize our results on post-model selection estimators to the case of 
lasso being the first-step estimator. The previous generic results allow us to use sparsity 
bounds and rate of convergence of lasso to derive the rate of convergence of post-model 
selection estimators in the parametric and nonparametric models. 
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5.1. Performance of ols post lasso 

Here we show that the ols post lasso estimator enjoys good theoretical performance 
despite (generally) imperfect selection of the model by lasso. 

Theorem 5 (Performance of ols post lasso). Suppose Conditions M , RE(c), and 
RSE(fh) hold where c = (c + l)/(c — 1) and fh = \T \ T\ . If A > cjiH^Hoo occurs with 
probability at least 1 — a, then for any e > there is a constant K e independent of n 
such that with probability at least 1 — a — e, for fi = x\fi we have 



11/ - /||p„,2 < K e a 



fh log p+ (m + s) log(e/^(m) ) 



3c. + l{T%T}J£ft 



Ay? / (l + e)Ay? 



) 1 cn«(l) 



+ 2Cs 



In particular, under Condition V and the data-driven choice of X specified in (2.12) with 
log(l/a) < logp, ujl < 1, for any e > there is a constant K' e a such that 



ll/-/lk,2 < 3c s + K' e>a (7 

+1{T <Z T} 

with probability at least 1 — a — e — t. 



fa log(pe^(m)) 



+ 



s log(e)j(m)) 



+ 



K' rr , ' s log P 1 I 



(5.21) 



This theorem provides a performance bound for ols post lasso as a function of 1) lasso's 
sparsity characterized by m, 2) lasso's rate of convergence, and 3) lasso's model selection 
ability. For common designs this bound implies that ols post lasso performs at least as 
well as lasso, but it can be strictly better in some cases, and has smaller rcgularization 
bias. We provide further theoretical comparisons in what follows, and computational 
examples supporting these comparisons appear in Supplementary Appendix. It is also 
worth repeating here that performance bounds in other norms of interest immediately 
follow by the triangle inequality and by definition of k as discussed in Remark 3.1. 

The following corollary summarizes the performance of ols post lasso under commonly 
used designs. 

Corollary 2 (Asymptotic performance of ols post lasso). Under the conditions of The- 
orem 5, (2.5) and (3.17), as n grows, we have that 



s log V 



ll/-/lk,2 <p 



o(s) logp 
n 

' 's/n + c s 



in general, 

if fh = op(s) and TCT wp 
if T = T wp — > 1 . 



Remark 5.1 (Comparison of the performance of ols post lasso vs lasso). We now 
compare the upper bounds on the rates of convergence of lasso and ols post lasso under 
conditions of the corollary. In general, the rates coincide. Notably, this occurs despite 
the fact that lasso may in general fail to correctly select the oracle model T as a subset, 
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that is T <2 T. However, if the oracle model has well-separated coefficients and condition 
and the approximation error does not dominated the estimation error - then ols post 
lasso rate improves upon lasso's rate. Specifically, this occurs if condition (2.5) holds and 
fh = op(s) and T C T wp — > 1, as under conditions of Theorem 2 Part 1 or in the 
case of perfect model selection, when T = T wp — > 1, as under conditions of [31]. Under 
such cases, we know from Corollary 1, that the rates found for lasso are sharp, and they 
cannot be faster than crys logp/n. Thus the improvement in the rate of convergence of 
ols post lasso over lasso in such cases is strict. 



5.2. Performance of ols post fit-lasso 

In what follows we provide performance bounds for ols post fit-lasso [3 defined in equation 
(4.20) with threshold (2.9) for the case where the first-step estimator (3 is lasso. We let 
T denote the model selected. 



Theorem 6 (Performance of ols post fit-lasso). Suppose Conditions M , RE(c), and 
RSE(fh) hold where c = (c + l)/(c — 1) and fh = \T \ T\. If A > cn||5|joo occurs with 
probability at least I — a, then for any e > there is a constant K e independent of n 
such that with probability at least 1 — a — e, for fi — x[j3 we have 



\\f - f\K,2 < K e c 



m log p+(m-+s) log(e/i(m)) 



+ 



3c s + 1{T % T}^^ ((i±^£ +2Cs ). 



Under Condition V and the data-driven choice of X specified in (2.12) with log(l/a) < 
logp, u/£ < 1, for any e > there is a constant K' e a such that 



I/- /111 



< 3c, 



m log(pe/^(m)) 



flogtotfm)) 



+1{T%T] 



s log p 1 
n k(1) 



(5.22) 



with probability at least 1 — a — e — r. 



This theorem provides a performance bound for ols post fit- lasso as a function of 1) 
its sparsity characterized by m, 2) lasso's rate of convergence, and 3) the model selection 
ability of the thresholding scheme. Generally, this bound is as good as the bound for 
ols post lasso, since the ols post fitness-thrcsholdcd lasso thresholds as much as possible 
subject to maintaining certain goodness-of-fit. It is also appealing that this estimator 
determines the thresholding level in a completely data-driven fashion. Moreover, by con- 
struction the estimated model is sparser than ols post lasso's model, which leads to an 
improved performance of ols post fitness-thresholded lasso over ols post lasso in some 
cases. We provide further theoretical comparisons below and computational examples in 
the Supplementary Appendix. 

The following corollary summarizes the performance of ols post fit-lasso under com- 
monly used designs. 
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Corollary 3 (Asymptotic performance of ols post fit-lasso). Under the conditions of 
Theorem 6, if conditions in (2.5) and (3.17) hold, as n grows, we have that the ols post 
fitness-thresholded lasso satisfies 



{ a \ ^IT 2 + Cs ' in general, 

a ^ o(s)io EP +g y| + Cs; if m = o P (s) andTQT wp -> 1, 
o-^ + Cs, if T — T tup-> 1. 

Remark 5.2 (Comparison of the performance of ols post fit-lasso vs lasso and ols 
post lasso) . Under the conditions of the corollary, the ols post fitness-thresholded lasso 
matches the near oracle rate of convergence of lasso and ols post lasso: <ry s logp/n + c s . 
If fh = op(s) and T C T wp — > 1 and (2.5) hold, then ols post fit-lasso strictly improves 
upon lasso's rate. That is, if the oracle models has coefficients well-separated from zero 
and the approximation error is not dominant, the improvement is strict. An interesting 
question is whether ols post fit-lasso can outperform ols post lasso in terms of the rates. 
We cannot rank these estimators in terms of rates in general. However, this necessarily 
occurs when the lasso does not achieve the sufficient sparsity while the model selection 
works well, namely when fh = op(fh) and T C T wp — > 1. Lastly, under conditions 
ensuring perfect model selection, namely condition of Theorem 2 holding for t = i 7 , ols 
post fit-lasso achieves the oracle performance, a^J 1 s/n + c s . □ 



5.3. Performance of the ols post thresholded lasso 

Next we consider the traditional thresholding scheme which truncates to zero all compo- 
nents below a set threshold t. This is arguably the most used thresholding scheme in the 
literature. To state the result, recall that /3 t j = > *}, m ■= \T\T\, m t := \ f\T\ 

and 74 := \\/3 t — P\\2,n where j3 is the lasso estimator. 

Theorem 7 (Performance of ols post t-lasso). Suppose Conditions M , RE(c), and 

RSE(rh) hold where c = (c+ l)/(c— 1) and fh = \T \ T\ . If A > cn||S||oo occurs with 
probability at least 1 — a, then for any e > there is a constant K e independent of n 
such that with probability at least 1 — a — e, for fi = x^ft we have 



ll/-/lk,a < *W ftl °^ + ' fi+ ;"°ri e "' ft » + 3c 3 + \{T % T} (yt + i±£-^| +2c s 



+l{TgT}. 



where < ty/<p(mt)mt. Under Condition V and the data-driven choice of A specified 
in (2.12) for log(l/a) < logp, ujl < 1, for any e > there is a constant K' e a such that 
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with probability at least 1 — a — e — t 

||/-/lk,2 < 3c s +K' e>a 

+1{T%T} 



m\ag{pe^(rn)) ^ / s log(e/j(m)) 



7* + ^.W^^+4c, 



+ 



This theorem provides a performance bound for ols post thresholded lasso as a func- 
tion of 1) its sparsity characterized by fh and improvements in sparsity over lasso char- 
acterized by m t , 2) lasso's rate of convergence, 3) the thresholding level t and resulting 
goodness-of-fit loss 74 relative to lasso induced by thresholding, and 4) model selection 
ability of the thresholding scheme. Generally, this bound may be worse than the bound 
for lasso, and this arises because the ols post thresholded lasso may potentially use too 
much thresholding resulting in large goodness-of-fit losses 74. We provide further theoret- 
ical comparisons below and computational examples in Section D of the Supplementary 
Appendix. 

Remark 5.3 (Comparison of the performance of ols post thresholded lasso vs lasso and 
ols post lasso). In this discussion we also assume conditions in (2.5) and (3.17) made 
in the previous formal comparisons. Under these conditions, ols post thresholded lasso 
obeys the bound: 



117- /lk, 2 <p + °^ + Cs + 1{T £ f} (7, V af-^j . (5.23) 

In this case we have fh V m t < s + fh <p s by Theorem 3, and, in general, the rate above 
cannot improve upon lasso's rate of convergence given in Lemma 1. 

As expected, the choice of t, which controls 74 via the bound 74 < ty (j>(m t )m t , can 
have a large impact on the performance bounds: If 



*W n then ll/-/llr„,2<P^V « +C - ^ 24 ) 

The choice (5.24), suggested by [19] and Theorem 3, is theoretically sound, since it 
guarantees that ols post thresholded lasso achieves the near-oracle rate of lasso. Note 
that to implement the choice (5.24) in practice we suggest to set t = X/n, since the 
separation from zero of the coefficients is unknown in practice. Note that using a much 
larger t can lead to inferior rates of convergence. 

Furthermore, there is a special class of models - a neighborhood of parametric models 
with well-separated coefficients - for which improvements upon the rate of convergence of 
lasso is possible. Specifically, if fh = op(s) and T C T wp — > 1 then ols post thresholded 
lasso strictly improves upon lasso's rate. Furthermore, if fh = op(m) and T C T wp — > 1, 
ols post thresholded lasso also outperforms ols post lasso: 



l/-/lk, 2 <P J«®*** + v I 3 
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Lastly, under the conditions of Theorem 2 holding for the given t, ols post thresholded 
lasso achieves the oracle performance, ||/ — /||p n ,2 ^Sp vyl s/n + c s . □ 

Appendix A: Proofs 
A.l. Proofs for Section 3 

Proof of Theorem 1. The bound in || • |2, n norm follows by the same steps as in [4], 
so we omit the derivation to the supplement. 

Under the data-driven choice (2.12) of A and Condition V, we have e'er > ccr with 
probability at least 1 — t since c' > cjl. Moreover, with the same probability we also 
have A < c'uaA(l — a\X). The result follows by invoking the j| • |2, n bound. 

The bound in || • ||i is proven as follows. First, assume ||<5t c ||i < 2c||#t||i- In this 
case, by definition of the restricted eigenvalue, we have ||<5||i < (1 + 2c)||<5t||i < (1 + 
2c) -^/sH <5|| 2,n/ k(2c) and the result follows by applying the first bound to 1 1 <5 1 1 2,n. since 
c > 1. On the other hand, consider the case that ||<5<r<=||i > 2c||<$t||i- The relation 

— (\\St\U + \\5t4i) + \\5\\l n - 2c s \\S\\ 2 , n < £(||*r||i 

which is established in (B.35) in the supplementary appendix, implies that ||£||2,n < 2c s 
and also 

f 77 f 71 f 71 

HMIi < c||«S r ||i + - T l|<5|| 2 ,„(2c s -|Hk„) < ll*r||i+ 7?<£ < dl<M|i + 77* 

c — 1 A c — 1 A 2 c — 1 A 



Thus, 



The result follows by taking the maximum of the bounds on each case and invoking the 
bound on ||£||2,n- 

□ 

Proof of Theorem 2. Part (1) follows immediately from the assumptions. 
To show part(2), let 5 := (3 — j3o, and proceed in two steps. 
Step 1. By the first order optimality conditions of (3 and the assumption on A 

llEnMailU < \K\x.(y. - x'J)]\\oo + ||S/2!U + ||E„[a:.r.]|| 00 



— 2n 2cn 



since HEnfc.r.JHoo < min|^, c s j by Step 2 below. 

Next let ej denote the jth-canonical direction. Thus, for every j = l,...,p we have 

\E n [e? j x.a? m 8\ - 5 3 \ = \E n [e' j (x.x' m - I)S\\ < maxK,-, k < P \(E n [x.x'. - I)) jk \ \\S\\i 

< ¥h/[us}. 
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Then, combining the two bounds above and using the triangle inequality we have 



< \\E n [x.x'J} 















hmln {^' Cs } 






* 2n 





Us 



The result follows by Theorem 1 to bound ||<5||i and the arguments in [4] and [19] to show 
that the bound on the correlations imply that for any C > 



k(C) > ^\-s{\ + 2C)\\E n [x.x'.-I}\\ 00 

so that k(c) > a/1 - [(1 + 2c) /U] and k(2c) > y/l - [(1 +4e)/U] under this particular 
design. 

Step 2. In this step we show that j|E„[x # r.] jjoo < min |-2=,c s | . First note that for 



every j = 1, . . . ,p, we have |E n [a; # jr.]| < y E„[x^]E n [r2] = c s . Next, by definition of /3 Q 

in (2.2), for j 6 T we have ¥, n [x»j(f» — x'»Po)\ = E n [x.jr,] = since /3o is a minimizer 
over the support of /3o- For j £ T c we have that for any t € 1R 

En[(/. - <M 2 } +<? 2l < E„[(/. - x'J - tx. 3 ) 2 } + a 2 — . 

n n 

Therefore, for any t € 1R we have 

-a 2 /n < E n [{f.-x'.[3 -tx. 3 ) 2 ]-E n [{U-x'.[i ) 2 ] = -2tE n [x.j{f. -x'.fo)} +t 2 E n {x 2 j ]. 

Taking the minimum over t in the right hand side at t* = E n [.T # j(/, — x',Pq)] we obtain 
-a 2 jn < -(E n [x,j(f, - x'./3q)}) 2 or equivalently, \E n [x.j(f. - x'.P )]\ < a/y/ri. □ 

Proof of Lemma 2. Let T = support(/3), and m = |T\T|. We have from the optimality 
conditions that \2E n [x,j(y, — x' t f3)]\ = X/n for all j £ T. Therefore we have for R = 
(ri,.. -,r n )' 



\T\X < 2\\(X'(Y-Xp)) f \\ 

< 2\\(X'{Y-R-Xp )) f \\+2\\(X'(R + Xl3 -Xp)) f \\ 



< y/\T\- nWSU + 2nyf^h){E n [{x'.f5 - f.) 2 }) 1 ' 2 , 

where we used the definition of </>(m) and the Holder inequality. Since A/c > n||S'|| 00 we 
have 



(1 - l/c)yJ\T\\ < 2nv^)(E„[(x:/3 - f.) 2 ]) 1 ' 2 . (A.25) 

Moreover, since m < |T|, and by Theorem 1 and Remark 3.1, (E n [(x./3 — f*) 2 ]) 1 ^ 2 < 
\W ~ Poh,n + c s < (1 + i) + 3c s we have 



(1 - l/c)Vm < 2 A /0(m)(l + l/c)y/s/n{c) + 6y/^(fh) nc s /X. 
The result follows by noting that (1 — 1/c) = 2/(c+ 1) by definition of c. □ 
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Proof of Theorem 3. In the event A > c • n||5||oo, by Lemma 2 vm < *J <p(fh) ■ 

2c s /s/k{c) + 3(c + l)y/<f>(fh) ■ nc s /X, which, by letting L n = (-^ + 3(c + , can 

be rewritten as 

fh < s ■ 4>(fh)L n . (A.26) 

Note that fh < n by optimality conditions. Consider any and suppose to > M. 

Therefore by Lemma 3 on sublincarity of sparse eigenvalues 



m < s ■ 



m 
M 



<t>(M)L n . 



Thus, since |~fc] < 2k for any k > 1 we have M < s-2<fi(M)L n which violates the condition 
of M £ M.. Therefore, we must have fh < M. In turn, applying (A.26) once more with 
fh < (M A n) we obtain fh < s ■ <p(M A n)L n . The result follows by minimizing the bound 
over M g M. □ 



A. 2. Proofs for Section 4 

Proof of Theorem 4. Let 8 := — /3o- By definition of the second-step estimator, it 
follows that Q{(5) < Q0) and Q0) < Q{j3 Qf ). Thus, 

Q0) - QWo) < (Q0) - Q(A>)) A (QCSof ) - Wo)) ^ S « A c »- 

By Lemma 4 part (1), for any e > there exists a constant K e such that with probability 
at least 1 - e: \Q0) - Q(/3„) - ||?|||,„| < A E> „||?|| 2 ,„ + 2c a ||?|| a , n where 

A e<n := K e ay/ (fhlogp +(fh + s) log(e/i(m)))/n. 

Combining these relations we obtain the inequality ||5||2 .„ — ^4c,n || 2,^ — 2c s || 5|| 2. T i < -B n A 
C„, solving which we obtain the stated inequality: ||5||2,n < A e ^ n +2c s + \J {B n ) + A (C n )+. 
Finally, the bound on i? ra follows from Lemma 4 result (1). The bound on C n follows 
from Lemma 4 result (2). □ 

Proof of Lemma 4- Part (1) follows from the relation 

|Q(A> +5)- Q(Po) - \\S\\l n \ = \2E n [e.x:S] + 2E„[r.a/.<5]|, 

then bounding |2E n [r.x^i5]| by 2c s ||<5||2,n using the Cauchy-Schwarz inequality, applying 
Lemma 5 on sparse control of noise to |2E n [e # 2^<5] | where we bound (£) by p m and set 
K e = 6^2 log 1 / 2 max{e, D, l/(e s e[l — 1/e])}. Part (2) also follows from Lemma 5 but 
applying it with s = 0, p — s (since only the components in T are modified), m = k, and 
noting that we can take fi(m) with m = 0. □ 
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Proof of Lemma 5. We divide the proof into steps. 

Step 0. Note that we can restrict the supremum over ||<5|| = 1 since the function is 
homogenous of degree zero. 

Step 1. For each non- negative integer m < n, and each set T C {l,...,p}, with 
\T \ T\ < m, define the class of functions 

Gf = {e l x , l S/\\S\\ 2 . n : support^) C f, \\S\\ = 1}. (A.27) 

Also define T m = {Gf : f C {1, . . . ,p} : \ f \ T\ < m}. It follows that 

P I sup |G„(/)| > e n (m,r]) ) < ( P ) max P I sup |G„(/)| > e n (m,r?) ] . (A.28) 
\f£f m J W \T\T\<m \feg ¥ J 

We apply Samorodnitsky-Talagrand's inequality (Proposition A. 2. 7 in van der Vaart 
and Wcllner [30]) to bound the right hand side of (A.28). Let 

p(f, g) := y/E[G n (f)-G n (g)\* = v^W - d) 2 } 

for /, g S Gf] by Step 2 below, the covering number of Gf with respect to p obeys 

N(s, Gf, p) < (6(7p(m)/e) m+s , for each < e < a, (A.29) 

and a 2 (Gf) := maxjgg- E[G n (f)] 2 — a 2 . Then, by Samorodnitsky-Talagrand's inequal- 
ity 

P ( sup |G n (/)| > e n (m, V )) < ( D<J ^ m >^ H \ ^ v)/a) (A . 30 ) 

\feGf J V \M + ser 2 J 

for some universal constant D > 1, where $ = I — $ and is the cumulative probabil- 
ity distribution function for a standardized Gaussian random variable. For e„ (m, 77) de- 
fined in the statement of the theorem, it follows that P ^supy eg ^ |G n (/)| > e n (m, 7j)j < 
V e ^ m s /(m) by simple substitution into (A. 30). Then, 

P\ sup |G n (/)| > e n (m,7?),3m < n < V? sup |G„(/)| > e n (m,r])} 



< ^^-™" s <r, e -7(l-l/ e ), 



m=Q 



which proves the claim. 

Step 2. This step establishes (A.29). For t £ R p and t £ M. p , consider any two functions 



and e,^-^- in Qf , for a given T C {1, . . . ,p} : \T \ T\ < 



\t\\2,n — -||t|| 2 , n " 



1)7. 
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Wc have that 



EE„ 



(x'.t) (x'.t) 



\\t\hn \\t\\2, 



< \ E~E n 



,2 (*'.(t-t)) 2 



+ 



\ 



EE n 



(x'.t) (x'.t) 



\\t\hn \\t\\ 2 ,: 



By definition of Qf in (A. 27), support(i) C T and support (t) C T, so that support(i 
t) QT, \T\T\ < to, and ||t|| = 1 by (A.27). Hence by definition RSE(to), 



EE r . 



EE,. 



" \\t\\ln 



< a 2 (j)(m)\\t -t\\ 2 /K(m) 2 , and 



(x'.t) (x'.t) 



2,n ||i||2,n 



EE r . 



^2«t) 2 n|t||2,„-||t|| 2 ,n 



' 11*11 



2 

2. n 



\\th,n 



2 / \\t\\-2,n - \\t\\2.n 
1 \ ||*||2,n 



< ° 2 \\t ~ t\\lJ\\t\\l n < a 2 J>(m)\\t - t\\ 2 /K(mf 



so that 



\ 



£1 



(x'.t) (x'.t) 



~'\¥h,n \\t\\, 



< 2a\\t-t\\y/(f>(m)/K(m) = 2afj,(m)\\t -t\\. 



Then the bound (A. 29) follows from the bound in [30] page 94, N(e,Gf,p) 
< N(e/R,B(0,1), || • ||) < (3R/s) m+s with R = 2o\i(m) for any e < a. 



□ 



A. 3. Proofs for Section 5 

Proof of Theorem 5. First note that if T C T we have C n = so that B n A C n < 
1{T ^ f }B n . 

Next we bound B n . Note that by the optimality of j3 in the lasso problem, and letting 

B n := Q0) Q(M < ^(H^oili - ||£||i) < ^(ll^rlla - ||M|i)- (A.31) 

If ||? T =||i > ||?r||i, we have Q0) - Q((3 ) < 0. Otherwise, if ||? T c||i < ||£r||i, by RE(1) 
we have 

B n := Q0) - Q(/3 ) < < ±^§f^- (A.32) 

The result follows by applying Theorem 1 to bound ||<$||2,n, under the condition that 
RE(1) holds, and Theorem 4. 

The second claim follows from the first by using A < s/n log p under Condition V, the 
specified conditions on the penalty level. The final bound follows by applying the relation 
that for any nonnegative numbers a, b, we have s/ab < (a + b)/2. □ 
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Appendix A: Additional Results and Comments 
A.l. On the Oracle Problem 

Let us now briefly explain what is behind problem (2.2). Under some mild assumptions, 
this problem directly arises as the (infeasiblc) oracle risk minimization problem. Indeed, 
consider a least squares estimator /3^, which is obtained by using a model T, i.e. by 
regressing yi on regressors Xj[T], where Xi[T) = {xij,j £ T}. This estimator takes value 
/% = E n [x.[f }x.[f]']-E n [x.[f ]y.]. The expected risk of this estimator E„E[/. - x'J3 f ] 2 
is equal to 

min E n [(f.-x.[f]'pf} + a 2 -, 

where k = rank(E„[x,[T]a;,[T]']). The oracle knows the risk of each of the models T and 
can minimize this risk 

~ k 
min min E„[(/. - x.[T]' /3) 2 ] + a 2 -, 

T /3GRl f l n 

by choosing the best model or the oracle model T. This problem is in fact equivalent 
to (2.2), provided that rank(E„[x. [T]x.[T]']) = ||/3o||o, i- e - full rank. Thus, in this case 
any value /3o solving (2.2) is the expected value of the oracle least squares estimator 
Pt = E„[ 2 ;.[r]x.[T]']- 1 E n [x.[r] 2 /.], i.e. /3 = E„[.T.[T]x.[T]']- 1 E„[a;.[T]/.]. This value 
is our target or "true" parameter value and the oracle model T is the target or "true" 
model. Note that when c s = we have that = x^Po, which gives us the special 
parametric case. 

A. 2. Estimation of cr — finite-sample analysis 

Consider the following algorithm to estimate a. 

Algorithm (Estimation of a using lasso iterations) Set t?o = ^/Var„ [?/.]. 

(1) Compute the lasso estimator /3 based on A = c'fT A(l — a\X); 

(2) Set a = y/Q@). 

The following lemmas establish the finite sample bounds on £, u, and r that appear 
in Condition V associated with using ctq and y Q(f3) as an estimator for a. 
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Lemma 6. Assume that for some k > 4 we have E[\yi\ k ] < C uniformly in n. There 
is a constant K such that for any positive numbers v and r we have with probability at 
least 1 



kc ^ KG 



\°l-<*l\ <v + r(r + 2C 1 / k ) 



where ctq = \J Var[y,] . 



Proof. We have that o 2 - a 2 = E n [y 2 - E[y 2 ]] - (E n [y.}) 2 + (EE„ \y»]) 2 - 

Next note that by Markov inequality and Rosenthal inequality, for some constant 
A(r/2) we have 



E E?=i Vi-Hv 



< 



A(r/2)max{i:r =1 E|^| fc . E \Vi\ i ) k/4 } 



PQMvl - %. 2 ]]l > «) < ^tw 

< A(k/2) max{ nC, Cri fc/4 } < A(fc/2)C 

— j^fc/2^jfc/2 — / 4yk /2 

Next note that (E^y.]) 2 - (EE„[?/.]) 2 = (E n [y. +E[y.]})(E n [y. - E[y.]]). Similarly by 
Markov inequality and Rosenthal inequality for some constant A(r), we have P(\E n [y, — 
E[y.]]|>r)<^.Thus, 

P(\(E n [y.]f (EE n [y.]) 2 | > r(r + 2C^ k )) < 
The result follows by choosing K > A(k) V A(k/2). 



□ 



Lemma 7. Suppose that Condition M holds and that A > cn||S'|| 00 with probability at 
least 1 — a. Then, for any £, 7 G (0, 1) we have 



< 1 



2c s \Jl 



2c s 



a 2 n 2 K(l) 2 a 2 nn(l) a 2 



x/21ogl/7 + . 



QiP) 



> 1 



c| (2 + 4c) 
2 



A 2 s 



?t, 2 k(2c)k(c) tt,k(2c) 



2c s 



v/21ogl/ 7 -£ 



w/iift probability 1 — a — 2 cxp(— n£ 2 /12) — 7. 
Proof. We start by 

_ Q(^)-E»[e 2 ] | E K [e 2 ] 
cr 2 cr 2 G 2 

To control the second term we invoke tail-bounds for the chi-square distribution, see for 
instance Lemma 4.1 in [1]. Indeed, for any e > we have 



^(E„[e 2 ] <a 2 (l-£))<exp 
P(E„[e 2 ] > <r 2 (l + £)) < cxp 



ne 
~2 



1 £ 

.2 ~ 3 

_£ 

2 3 



and 
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To bound the first term, we have 

Q0) - E n [e 2 .} = Q0) - 0(A)) + E n [r 2 .} + 2E„[e.r.]. 

where E n [r 2 ] = c 2 an d since 2E„[e.r.] ~ N(0, Aa 2 c 2 /n) it follows that 2E n [e.r.] < 
cr ( 2 c s /y / n)( a/ 2 log Tpy) with probability 1 — 7. 

Finally, we bound the term Q(/3) — 0(A)) from above and below. To bound above, we 
use the optimality of /3, so that Q(J3) - 0(A)) < £(||<*t||i - ||<Mli)- If IIMi < ||<Mli 
we have Q((3) — 0(A)) < 0. Thus we can assume ||(5t c ||i < ||<5t||i- Then, with probability 
at least 1 — a wc have A > cnj|S'|| 00 and by the definition of RE(1) and Theorem 1 wc 
have 



0(A) - 0(A)) < 



A^i 



S\kn < 1 + - 



1\ A 2 s 



2c s A7i 



c / ti 2 k(1) 2 n/t(l) 



n/t(l) 

To bound from below note that by convexity 

Q0) - 0(A>) > \\5\\l n - \\s\u\s\u - 2 Cfl ||j|| 2 , n 

It follows that H^lll „ — 2c s ||<5||2,n > ~ c 2 . Next, we invoke the £i-norm bound in Theorem 
1 so that 



0(A) - 0(A)) > -<2 



A(2 + 4c) v/i / A^i 



+ c ; 



V 



(2 + 4c)c 



21 



cn k(2c) \tik(c) 
The result follows by simplifying the expression above. 

The result below verifies Condition V relying on Lemmas 6 and 7. 



□ 



Theorem 8. Assume that Condition M hold and for some k > 4 we have Fj[\yi\ k ] < C 
uniformly in n. Then, for any £,76 (0, 1) we have that Condition V holds with 



K2 k/2 



= i-«-^(cK)(i-c/ c 'r fc/2 



K6 k 

n k/2 



(C 2 /a 2k ) ■ (1 - c/c'y k - 2eM-ne 2 /l2) - 7, 



u < 1 + 



2s (3c'<t A(1 - a\X)f 



c (l)= 



+ (3cVoA(l - a\X)) 



2c„-Js 



2c, 



x + -| + — p^logl^ + e, 



£> 



(2 + 4c) 



(3cV A(l - a\X)Y s , c 3 (3c'a A(l - a|X)) 2 
H 7^ 1- c s 



n 2 k,{2c)k(c) 



nn(2c) 



2c, 



V2 log 1/7- 



Proo/. By Lemma 6 with v = cr 2 • (1 - c/c')/2 and r = (crg/C 1 ^) • (1 - c/c')/6, with 
probability at least 1 - £g£ (£?/*§) (1 - c / c ')- fc/2 - ^(C 2 /^ fc ) ■ (1 - c/c')" fc , we have 
l^o — °ol < <?o(l ~ c / c ') so that 

c/c' < ^ < 2 + c/c' < 3. 



Since a < erg, f° r A = c' • ctq ■ A(l — a|X), wc have A > cn||S'|| 00 . with probability at 



least I- a- ™^(C/a$)(l c/c')^' 2 - %(C 2 /a 2k ) ■ (1 - c/c')- k . 

Thus, by Lemma 7, we have that Condition V holds with the stated bounds. 



□ 
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Under the typical design conditions 

«(2c) > 1, a = o(l), and slog(p/a) = o(n), (A. 33) 

the bounds stated in Theorem 8 establish that I —> 1 , u —> 1 and r —> asymptotically. 
In finite samples, the following lemma ensures that £ > 0. 



Lemma 8. We have that So > and a = y > imi/i •probability 1. 

Proof. First note that <7o = -\/Var n [y,] = only if yi = y for every i = 1, . . . , n. That 
is, Ci = E„[x^/3o + £•] — a^/3o which is a zero measure event. 

Next note that er = Q(/3) = only if j/j = for every i = l,...,n. By the 

optimality conditions we have € VQ(/3) + • Since VQ(j3) = 0, we have 

€ 9|| • ||i(/3) which implies that f3 = 0. In turn = = for every i = 1, . . . , n which 

is a zero measure event since yi = x^q + e%. □ 



A.3. Perfect Model Selection 



The following result on perfect model selection also requires strong assumptions on 
separation of coefficients and the empirical Gram matrix. Recall that for a scalar v, 
sign(w) = v/\v\ if |u| > 0, and otherwise. If v is a vector, we apply the definition 
componentwise. Also, given a vector x £ H p and a set T C {1, . . . ,p}, let us denote 
Xi[T] := {xij,j E T}. 

Lemma 9 (Cases with Perfect Model Selection by lasso). Suppose Condition M holds. 
We have perfect model selection for lasso, T = T, if and only if 



muijeT 



Aw 



[x. [T c ]x. [T]'] E„ [x. [T]x. [T] 1 ] 1 {e„ [x. [T]u.] 

-^sign(^o[T])} - E n [x.[T c ]u.} < ±, 

E„ [^.[^^.[T]']- 1 {E n [x.[T]u.] - ^sign(/3 [T])} 



> 0. 



The result follows immediately from the first order optimality conditions, see [31]. The 
paper [34] provide further primitive sufficient conditions for perfect model selection for the 
parametric case in which m = £i, and [5] provide some conditions for the nonparametric 
case. The conditions above might typically require a slightly larger choice of A than 
(2.12), and larger separation from zero of the minimal non-zero coefficient min^gr \Poj\- 
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Appendix B: Omitted Proofs 
B.l. Section 2: Proof of Lemma 1 

Proof of Lemma 1. We can assume that m + s > 1. Let Oj = ¥, n [5%j] for j = 
I,..., p. Moreover, let c*(m) and c*(m) denote the minimum and maximum m-sparse 
eigenvalues associated with E„[;r,i£] (unnormalized covariates). It follows that 4>(m) < 
maxKjXp ct|c* (m + s) and f?(m) 2 > mini<j-<p OjC* ( m + s )- These relations shows that 
for bounding c*(m + s) and c*(m + s) it suffices to bound <j)(rn), k(jto), and deviations of 
<7j 's away from 1 . 

Note that P(maxi<j<p |<7j — 1| < 1/4) — > 1 as n grows, since 

P(maxi<j-< p \dj - 1| > 1/4) < pmaxi< 3 < p P(\a? - 1| > 1/4) 

< 2pexp(-7i 2 /[32nJsT2 + 81^/3]) -> 

by Bernstein's inequality (Lemma 2.2.9 in [30]), Var(x^) < if 2 , and the side condition 
Kllogp = o(n). 

Under s log(n) log 2 (s log n) < n[n/ tp 1 / 2 ]^/ K n ] 2 /[(\ogp){\ogn)] for some e > small 
enough, the bound on <j>{m) and f?(m) 2 follows from the application of (a simple extension 
of) results of Rudelson and Vershynin [25], namely Corollary 4 in Appendix C. □ 

B.2. Section 3: Proofs of Theorem 1 and Proof of Lemma 3 

Proof of || • \\2, n bound in Theorem 1. Similar to [4], we make the use of the following 
relation: for 8 = f3 — (3 , if A > cn||S'|| 00 

Q0)-QWo)-\\S\\l n = 2E n [e.x'.5} + 2E n [r.x'.8] > -||S|UI*||l - 2c a \\8\\ 2 , n 

> ~ — (||*r||i + ||*r«|| l )-2c||<y|| 2 ,„ (B.34) 
cn 

By definition of % Q0) - Q(/3 ) < £ lift IK ~ $\\0\h> which implies that 

- -(||*r||i + IIMIi) + \\S\\i n 2c s \\Sh,n < -(\\6t\\i ||MI0 (B.35) 
cn n 

If ll^lll n ~ 2c s |! 2,n, < 0, then we have established the bound in the statement of the 
theorem. On the other hand, if ||£||2„ — 2c s ||5||2,n > we get for c= (c + l)/(c— 1) 

||«r«||i<c.||*r||i. (B.36) 

and therefore S belongs to the restricted set in condition RE(c). From (B.35) and using 
RE(c) we get 

ll«,„-2c.||*,„<(i + i)i|fc|l.<(i + i)^H|f 

\ c J n \ c J n K(C) 

which gives the result on the prediction norm. □ 
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Proof of Lemma 3. Let W := E n [x.xi] and a <E IR P be such that 4>(\ik~\) = a'Wa 
and \\a\\ = 1. We can decompose 

m It] 
a = ^oti, with £ ll a iT=||o = II"t c ||o and a iT = a T / \€\ , 

i=l i=l 

where we can choose a^'s such that \\a.iT" ||o < fc for each i = 1, . . . ,\£], since |T| > \£k~\ . 
Note that the vectors a^s have no overlapping support outside T. Since W is positive 
semi-definite, a^Wai + a'jWaj > 2 la-W^o^l for any pair Therefore 

<t>(\ek]) = a'Wa = Eli\Ei"Ji«i^ 

< m Effi IMlV(ll««r<=llo) < m max !=1 [£] </>(IKHo) < WC*), 

where we used that 

m m n- 1,2 m 

£ K|| 2 = Edl^l 2 + W) = + £ IkHI 2 < ||«f = 1. 

i=l i=l ' ' ' i=l 

□ 



B.3. Section 4: Relation after (A. 30) in Proof of Lemma 5 



Proof of Lemma 5: Relation after (A. 30). First note that < exp(— 1 2 /2) for 
t>\. Then, 



D a fj,(m) e n (m ,77) 



m-\-s 



$(e n (m,rj)/o) 



< exp f- e "2^ ,??) + ( m + s ) lo S 



e n { rn -, r )) 



cxp 



f (ro+s) 


e ra (m,?7) 


I 2 





+ (m + s) log 



+ (m + s) log(L>er^(m))^ 

(m + s) log(Dafi(m)) 



/m+sa 



Next note that logx < x 2 /4 if x > 2\/2. Note that e n (m, i])/[y/m + sa] > 2\/2 since 
^t(m) > 1 and we can take D > e. Thus, the expression above is bounded by 



I < exp 



exp 



(m + s) 



\Jm + sa 



(m + s) \og(Da/j,(m)) 



e 2 n {m,ri) 



< exp — log 



4cr 2 
m 



+ (m + s) \og(Da/j,(m)) 
- (m + s) log(l/r?) 



□ 
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B.4. Section 5: Proofs of Theorem 6 and 7 

In this Section we provide the proof for Theorems 6 and 7. We begin with Theorem 6 
which threshold level is set based on the fit of the second step estimator relative to the 
fit of the original estimator, in this case lasso. 

Proof of Theorem 6. Let B n := Q0) - Q{/3 ) and C n := Q{/3 Qf ) - Q{fi Q ). It follows 
by definition of the estimator that B n < 7 + Q{0) — Q(A))- Thus, by Theorem 4, for any 
e > 0, there is a constant K s independent of n such that with probability at least 1 — s 
we have 

lift ft II s v / ralogp+ (m + s)log(e/x(m)) / ~ ~ 

\\P - Polkn < K eV\l h 2c s + y (B n ) + A (C„) + , 

(B n )+ < 7 + Q0) - 0(A)). 

A (C n )+ < 1{T % T}(B n ) + , 

since C n = if T C 

We bound -B„ = Q(/3) — Q{(3o) as in Theorem 5, namely, 

nK(lj \ c/ n z K{l) z nK{l) 

The second claim follows from the first by using A < \/ n log p under Condition V, the 
specified conditions on the penalty level. The final bound follows by applying the relation 
that for any nonnegative numbers a, 6, we have Vafe < (a + b)/2. □ 

The traditional thresholding scheme which truncates to zero all components below a 
set threshold t. This is arguably the most used thresholding scheme in the literature. 
Recall that % 3 = > t}, m := \T\T\, m t := \T\T\ and y t := ||A - /3|| 2 ,™ where 

/3 is the lasso estimator. 

Proof of Theorem 7. Let B n := Q(/3*) - Q(/? ) and C„ := Q{/3 of ) - Q{/3 ). 

By Theorem 4 and Lemma 4, for any e > 0, there is a constant if £ independent of n 
such that with probability at least 1 — e we have 

W-Poh,n < ^lcgp + (m + ,)lcgMi5)) +2es + ^ n)+A(5n) + i 

(S n )+< ||^-/3o||l„ + 

(B„)+ A(C„)+ < l{Tgf}(B„) + , 

since C„ = if T C f. 

Next note that by definition of -f t , we have ||/3* — /?o||2,n < 7t + ||/3 — /3o II 2,rx- The result 
follows by applying Theorem 1 to bound \\/3 — /3o||2,n- 
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The second claim follows from the first by using A < y/n log p under Condition V, 
the specified conditions on the penalty level, and the relation that for any nonnegative 
numbers a, b, we have Vab < (a + 6)/2. □ 



Appendix C: Uniform Control of Sparse Eigenvalues 

In this section we provide a simple extension of the sparse law of large numbers for 
matrices derived in [25] to the case where the population matrices are non-isotropic. 

Lemma 10 (Essentially in [25] Lemma 3.8). Let x\, . . . ,x n , be vectors in JRF with 
uniformly bounded entries, ||2i||oo < K for all i = l,...,n. Then, for independent 
Rademacher random variables Si, i = 1, . . . , n, we have 



E 



sup 

H| <fc,|MI=i 



< ( CKVk log(fc) \/\og(p V n)^\og n) sup I y^fala) 

V ' H| <fc,||a||=l 



1/2 



where C is a universal constant. 

Proof. The proof follows from Rudelson and Vcrshynin [25] Lemma 3.8 setting A = 
K/\fk instead of A = 1/Vk so that the constant C(K) can be taken C ■ K. □ 

Lemma 11 (Essentially in [25] Theorem 3.6). Let xi, i = l,...,n, be i.i.d. random 
vectors in H p with uniformly bounded entries, \\xi\loo < K a.s. for all i = 1, . . . ,n. Let 

8 n '■= 2 [cK\fk log(/s)-\/log(p V n)\/\og nj j\fn, where C is the universal constant in 
Lemma 10. Then, 



E 



sup |E„ [{a' Xl ) 2 - E[(a'a; 4 ) 2 ]] 

I a II o < || a ||=1 



< Si + 5 n sup V /E[(a'x i ) 2 ] 

||a||o<fc,||a || = 1 



Proof. Let 



V k = sup |E„ [(a' Xl ) 2 - E[(a'x,) 2 ]] | 

I alia II Oi II — 1 



Then, by a standard symmetrization argument (see Guedon and Rudelson [16], page 
804) 

nE[V k ] < 2E X E £ [sup|| Q || < fej | |Q!||=1 |£? = i e l {a'x l ) 2 \ 

Letting 



4>{k) 



sup E n [ ( ol Xi ) 2 } and ip(k) 

\\a\\o<k.\\a\\<l 



sup E[(a'x l ) 2 ], 

||a||o<fc,Ho||=l 



we have 4>{k) < (p(k) + Vk and by Lemma 10 

nE[Vfc] < 2 (cKVk\og(k)^log( P V n)^T^i) ^ x [y^(k) 

< 2 (cKVklog(k)y/log(pVn)yflogE) vW^( fc ) + Wk] 
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The result follows by noting that for positive numbers v, A, B, v < A(v + B) 1 ! 2 implies 
v < A 2 + Ay/B. □ 

Corollary 4. Suppose Xi, i = 1, . . . , n, are i.i.d. vectors, such that the population design 
matrix E[xi^] has its k-sparse eigenvalues bounded from above by ip < oo and bounded 
from below by k > 0. // Xi are arbitrary with maxi<;< n H^Hoo < K n o-S-i ond the 
condition K 2 k log 2 (fc) log(n) log(p V n) — o{nn A / ip) holds, 

P ( sup E n [(a'x t ) 2 ] < 2tp, inf E n [{a'x t ) 2 } > n 2 /2 1=1- o(l). 

\J|a||o<fc,||a||=l ll«l|o<fe.||a|| = l J 

Proof. Let T4 = sup|| Q ,|| 0<fc ||q,|| = i |En [(ct'^i) 2 — E[(a'xi) 2 ]] | . It suffices to prove that 
P(V k > k 2 /2) = o(l). Indeed, 

sup E n [(a'xi) 2 } <V k + ip and inf E„[(a'xi) 2 ] > n 2 - V k . 

||a|| <A:,||a||=l ll"l|o<fc,||tt||=l 

By Markov inequality, P(V k > k 2 /2) < 2E[Vk]/K 2 and the result follows provided 
that E[V k ] = o{k 2 ). 

For 5 n := 2 \ CK n \/fclog(fc)i/log(p V njy/Tog nj j \fn, by Lemma 11, we have E[Vk] < 
$n + finyf<P = o(k 2 ) by the growth condition in the statement. □ 



Appendix D: Empirical Performance Relative to lasso 

In this section we assess the finite sample performance of the following estimators: 1) 
lasso, which is our benchmark, 2) ols post lasso, 3) ols post fit-lasso, and 4) ols post 
t-lasso with the threshold t = X/n. We consider a "parametric" and a "nonparametric" 
model of the form: 

2/i =/* + £», fi = z'^o, £i ~ N(0, a 2 ), i = l,...,n, 

where in the "parametric" model 

0o = C- [1,1, 1,1, 1,0,0,... ,0]', (D.37) 

and in the "nonparametric" model 

O = C- [1,1/2, 1/3,..., 1/p}'. (D.38) 

The reason the latter model is called "nonparametric" is because in that model the 
function f(z) = ZjOoj is numerically indistinguishable from the function g{z) — 

Y^rLi z jlj0i characterized by the infinite-dimensional parameter jj with true values 7^0 = 

i/j: 
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The parameter C determines the size of the coefficients, representing the "strength of 
the signal", and we vary C between and 2. The number of regressors is p = 500, the 
sample size is n = 100, the variance of the noise is a 2 = 1, and we used 1000 simulations 
for each design. We generate regressors from the normal law Zi ~ N(0, £), and consider 
three designs of the covariance matrix S: a) the isotropic design with = for j ^ k, 
b) the Toeplitz design with T,jk = (l/2)l J_fc l, and c) the equi-correlated design with 
Ejfc = 1/2 for j ^ k; in all designs Hjj = 1. Thus our parametric model is very sparse and 
offers a rather favorable setting for applying lasso- type methods, while our nonparamctric 
model is non-sparse and much less favorable. 

We present the results of computational experiments for each design a)-c) in Figures 
2-4. The left column of each figure reports the results for the parametric model, and 
the right column of each figure reports the results for the nonparametric model. For each 
model the figures plot the following as a function of the signal strength for each estimator 
/3: 

• in the top panel, the number of regressors selected, _E[|T|], 

• in the middle panel, the norm of the bias, namely \\E[f3 — 9q\\\, 

• in the bottom panel, the average empirical risk, namely E[E n [fi — z^/3] 2 ]. 

We will focus the discussion on the isotropic design, and only highlight differences for 
other designs. 

Figure 2, left panel, shows the results for the parametric model with the isotropic 
design. We see from the bottom panel that, for a wide range of signal strength C, both 
ols post lasso and ols post fit-lasso significantly outperform both lasso and ols post t-lasso 
in terms of empirical risk. The middle panel shows that the first two estimators' superior 
performance stems from their much smaller bias. We sec from the top panel that lasso 
achieves good sparsity, ensuring that ols post lasso performs well, but ols post fit-lasso 
achieves even better sparsity. Under very high signal strength, ols post fit-lasso achieves 
the performance of the oracle estimator; ols post t-lasso also achieves this performance; ols 
post lasso nearly matches it; while lasso does not match this performance. Interestingly, 
the ols post t-lasso performs very poorly for intermediate ranges of signal. 

Figure 2, right panel, shows the results for the nonparametric model with the isotropic 
design. We see from the bottom panel that, as in the parametric model, both ols post 
lasso and ols post fit-lasso significantly outperform both lasso and ols post fit-lasso in 
terms of empirical risk. As in the parametric model, the middle panel shows that the first 
two estimators are able to outperform the last two because they have a much smaller 
bias. We also see from the top panel that, as in the parametric model, lasso achieves good 
sparsity, while ols post fit-lasso achieves excellent sparsity. In contrast to the parametric 
model, in the nonparamctric setting the ols post t-lasso performs poorly in terms of 
empirical risk for almost all signals, except for very weak signals. Also in contrast to the 
parametric model, no estimator achieves the exact oracle performance, although lasso, 
and especially ols post lasso and ols post fit-lasso perform nearly as well, as we would 
expect from the theoretical results. 

Figure 3 shows the results for the parametric and nonparametric model with the 
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Toeplitz design. This design deviates only moderately from the isotropic design, and 
we see that all of the previous findings continue to hold. Figure 4 shows the results 
under the equi-correlated design. This design strongly deviates from the isotropic design, 
but we still see that the previous findings continue to hold with only a few differences. 
Specifically, we see from the top panels that in this case lasso no longer selects very 
sparse models, while ols post fit-lasso continues to perform well and selects very sparse 
models. Consequently, in the case of the parametric model, ols post fit-lasso substantially 
outperforms ols post lasso in terms of empirical risk, as the bottom-left panel shows. In 
contrast, we see from the bottom right panel that in the nonparametric model, ols post 
fit-lasso performs equally as well as ols post lasso in terms of empirical risk, despite the 
fact that it uses a much sparser model for estimation. 

The findings above confirm our theoretical results on post-model selection estimators 
in parametric and nonparametric models. Indeed, we see that ols post fit-lasso and ols 
post lasso are at least as good as lasso, and often perform considerably better since they 
remove penalization bias, ols post fit-lasso outperforms ols post lasso whenever lasso 
does not produce excellent sparsity. Moreover, when the signal is strong and the model 
is parametric and sparse (or very close to being such), the lasso-based model selection 
permits the selection of oracle or near-oracle model. That allows for post-model selection 
estimators to achieve improvements in empirical risk over lasso. Of particular note is the 
excellent performance of ols post fit-lasso, which uses data-driven threshold to select a 
sparse model. This performance is fully consistent with our theoretical results. Finally, 
traditional thresholding performs poorly for intermediate ranges of signal. In particular, 
it exhibits very large biases leading to large goodness-of-fit losses. 
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Figure 2. This figure plots the performance of the estimators listed in the text under the isotropic 
design for the covariates, Ejj. = if j ' ^ k. The left column corresponds to the parametric case and the 
right column corresponds to the nonparametric case described in the text. The number of regressors is 
p = 500 and the sample size is n = 100 with 1000 simulations for each value of C. 
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Figure 3. This figure plots the performance of the estimators listed in the text under the Toeplitz design 
for the covariates, S^fc = p\i~k\ if j ^ k. The left column corresponds to the parametric case and the 
right column corresponds to the nonparametric case described in the text. The number of regressors is 
p = 500 and the sample size is n = 100 with 1000 simulations for each value of C. 
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Figure 4. This figure plots the performance of the estimators listed in the text under the equi-correlatcd 
design for the covariates, Ejfc = p if j ^ k. The left column corresponds to the parametric case and the 
right column corresponds to the nonparametric case described in the text. The number of regressors is 
p = 500 and the sample size is n = 100 with 1000 simulations for each value of C. 
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